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Abstract 

Expectation maximization (EM) has recently been shown to be an efficient algorithm 
for learning finite-state controllers (FSCs) in large decentralized POMDPs (Dec-POMDPs). 
However, current methods use fixed-size FSCs and often converge to maxima that are far 
from optimal. This paper considers a variable-size FSC to represent the local policy of each 
agent. These variable-size FSCs are constructed using a stick-breaking prior, leading to a new 
framework called decentralized sticlc-breakingpolicy representation (Dec-SBPR). This approach 
learns the controller parameters with a variational Bayesian algorithm without having to assume 
that the Dec-POMDP model is available. The performance of Dec-SBPR is demonstrated 
on several benchmark problems, showing that the algorithm scales to large problems while 
outperforming other state-of-the-art methods. 


1 Introduction 

Decentralized partially observable Markov decision processes (Dec-POMDPs) 01 [25] provide a 
general framework for solving the cooperative multiagent sequential decision-making problems 
that arise in numerous applications, including robotic soccer If24ll . transportation [51, extraplanetary 
exploration O, and traffic control 051 . Dec-POMDPs can be viewed as a POMDP controlled 
by multiple distributed agents. These agents make decisions based on their own local streams 
of information (i.e., observations), and their joint actions control the global state dynamics and 
the expected reward of the team. Because of the decentralized decision-making, an individual 
agent generally does not have enough information to compute the global belief state, which is a 
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sufficient statistic for decision making in POMDPs. This makes generating an optimal solution in a 
Dec-POMDP more difficult than for a POMDP iflOll . especially for long planning horizons. 

To circumvent the difficulty of solving long-horizon Dec-POMDPs optimally, while still generat¬ 
ing a high quality policy, this paper presents scalable learning methods using a finite memory policy 
representation. For infinite-horizon problems (which continue for an infinite number of steps), sig¬ 
nificant progress has been made with agent policies represented as finite-state controllers (FSCs) that 
map observation histories to actions (9l [2]]. Recent work has shown that expectation-maximization 
(EM) [}T4| is a scalable method for generating controllers for large Dec-POMDPs [l9l l27 |. In 
addition, EM has also been shown to be an efficient algorithm for policy-based reinforcement 
learning (RL) in Dec-POMDPs, where agents learn FSCs based on trajectories, without knowing or 
learning the Dec-POMDP model [f35ll . 

An important and yet unanswered question is how to define an appropriate number of nodes 
in each FSC. Previous work assumes a fixed FSC size for each agent, but the number of nodes 
affects both the quality of the policies and the convergence rate. When the number of nodes is too 
small, the FSC is unable to represent the optimal policy and therefore will quickly converge to a 
sub-optimal result. By contrast, when the number is too large, the FSC overfits data, often yielding 
slow convergence and, again, a sub-optimal policy. 

This paper uses a Bayesian nonparametric approach to determine the appropriate controller 
size in a variable-size FSC. Following previous methods [35, [25j], learning is assumed to be 
centralized, and execution is decentralized. That is, learning is accomplished offline based on all 
available information, but the optimization is only over decentralized solutions. Such a controller is 
constructed using the stick-breaking (SB) prior The SB prior allows the number of nodes to be 
variable, but the set of nodes that is actively used by the controller is encouraged to be compact. 
The nodes that are actually used are determined by the posterior, combining the SB prior and the 
information from trajectory data. The framework is called the decentralized stick-breaking policy 
representation (Dec-SBPR) to recognize the role of the SB prior. 

In addition to the use of variable-size FSCs, the paper also makes several other contributions. 
Specifically, our algorithm directly operates on the (shifted) empirical value function of Dec- 
POMDPs, which is simpler than the likelihood functions (a mixture of dynamic Bayes nets (DBNs)) 
in existing planning-as-inference frameworks Ifl9ll351 . Moreover, we derive a variational Bayesian 
(VB) algorithm for learning the Dec-SBPR based only on the agents’ trajectories (or episodes) 
of actions, observations, and rewards. The VB algorithm is linear in the number of agents and 
at most square in the problem size, and is therefore scalable to large application domains. In 
practice, these trajectories can be generated by a simulator or a set of real-world experiences that 
are provided, and this batch data scenario is general and realistic, as it is widely adopted in learning 
from demonstration ll23l . and reinforcement learning. To the best of our knowledge, this is the 
first application of Bayesian nonparametric methods to the difficult and little-studied problem 
of policy-based RL in Dec-POMDPs, and the proposed method is able to generate high-quality 
solutions for large problems. 


2 Background and Related Work 

Before introducing the proposed method, we first describe the Dec-POMDP model and some related 
work. 
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2.1 Decentralized POMDPs 


A Dec-POMDP can be represented as a tuple M = (N',A,S,0,T,fl,TZ,^), where A f = 
{1, • • • , N} is a finite set of agent indices; A = ® n A n and O = Z n O n respectively are sets 
of joint actions and observations, with A n and O n available to agent n. At each step, a joint action 
a = (ai, • • • , a./v) G A is selected and a joint observation o= (oi, • • • , on) is received; S is a set of 
finite world states; T : 5 x A x 5 —>■ [0,1] is the state transition function with T(s'\s, a ) denoting 
the probability of transitioning to s' after taking joint action a in s; O : S x A x O —* [0,1] is 
the observation function with f2(o|s', a) the probability of observing 6 after taking joint action a 
and arriving in state s'; 1Z : S x A —» M is the reward function with r(s, a) the immediate reward 
received after taking joint action a in s; 7 € [0,1) is a discount factor. A global reward signal is 
generated for the team of agents after joint actions are taken, but each agent only observes its local 
observation. Because each agent lacks access to other agents’ observations, each agent maintains 
a local policy 717, defined as a mapping from local observation histories to actions. A joint policy 
consists of the local policies of all agents. For an infinite-horizon Dec-POMDP with initial belief 
state 6 0 , the objective is to find a joint policy T = ® n \l/ n , such that the value of 'k starting from b 0 , 
V*(b(s 0 )) = is maximized. 

An FSC is a compact way to represent a policy as a mapping from histories to actions. Formally, 
a stochastic FSC for agent n is defined as a tuple @ n = (A n . O n , Z n . jj n . W n , ir n ), where, A n and 
O n are the same as defined in the Dec-POMDP; Z n is a finite set of controller nodes for agent n\ 
H n is the initial node distribution with /i z n the probability of agent n initially being in z; ii j, is a set 
of Markov transition matrices with W z,z 0 denoting the probability of the controller node transiting 
from 2 to zJ when agent n takes action a in z and sees observation o; 7 r„ is a set of stochastic policies 
with 7 r“ z the probability of agent n taking action a in z. 

For simplicity, we use the following notational conventions. Z n = {1, 2, • • • , C n }, where 

def 

C n =’ \Z n \ is the cardinality of Z n , and A n and O n follow similarly. 0 = {0i, • • • ,0Arjis 
the joint FSC of all agents. A consecutively-indexed variable is abbreviated as the variable with 
the index range shown in the subscript or superscript; when the index range is obvious from the 
context, a simple is used instead. Thus, a ni0: T = (an,o, On,i ,..., a n) r) represents the actions of 
agent n from step 0 to T and W*’ ao = (W Z a. 0 , ^nLo- " • , Wn, ffl) represents the node transition 
probabilities for agent n when starting in node z, taking action a and seeing observation o. Given 
h n t = {a n> o-.t-i, a l° ca l history of actions and observations up to step t, as well as an agent 

controller, @ n , we can calculate a local policy p(a n ,t\h n ,t,Q n ), the probability that agent n chooses 
its action a Hjt . 

2.2 Planning as Inference in Dec-POMDPs 

A Dec-POMDP planning problem can be transformed into an inference problem and then efficiently 
solved by EM algorithms. The validity of this method is based on the fact that by introducing binary 
rewards R such that P(R = 1| a, s ) oc r(a, s), Va G A, s G S and choosing the geometric time prior 
p(T) = 7 T (1— 7 ), maximizing the likelihood L(0) = P(R = 1; 0) = Et=o p ( t ) p ( r = l \ T 5 0 ) 
of a mixture of dynamic Bayes nets is equivalent to optimizing the associated Dec-POMDP policy, 
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as the joint-policy value V (0) and L(0) can be related through an affine transform IIT9TI 


L<0) = 


(1 - 7)(V(0)-E T 7 T ff mt n) _ (1 - T)t>(e) 


Rr 


Rr 


Rr, 


Rr 


( 1 ) 


where R max and R m m arc maximum and minimum reward values mdV(Q)=V((-)) — ^ T E ! R min 
is a shifted value. 

Previous EM methods fl9l 1271 have achieved success in scaling to larger problems by factoring 
the distribution over states and histories for inference, but these methods require using a Dec- 
POMDP model to construct a Bayes net for policy evaluation. When the exact model parameters 
T, and 7 Z are unknown, one needs to solve a reinforcement learning (RL) problem. To address 
this important yet less addressed problem, a global empirical value function extended from the 
single-agent case ll2()Tl . is constructed based on all the action, observation and reward trajectories, 
and the product of local policies of all agents. This serves as the basis for learning (fixed-size) FSCs 
in RL settings. 


Definition!. (Global empirical value function) Let P (A) = {(oq ao'^oO^afr^ ■ ■ ■ of afi r k 


)}*=! 


'T k )fk=l,-,K 

be a set of episodes resulting from N agents who choose actions according to T = (g) n \& n , a set 
of stochastic behavior policies with ji l ' n (a\h) > 0, V action a, V history h. The global empirical 

def.^K nt =0 n! = lP(fln,rl<r.9n) 


value function is defined as V 0) = Yhk= i St= 

Kit = o k n l:t ), 0 < 7 < 1 is the discount. 


K 


m 


= 0 n^=lP*" Gn.r\ h n,-r) 


where 


According to the strong law of large numbers lf32ll . V (0) = lim^oo V (V^ K ' > ] 0), i.e., with a 
large number of trajectories, the empirical value function V (T)‘ h 0) approximates V (0) accu¬ 
rately. Hence, applying (jT]), V (V ik ] \ 0) approximates L (0), and offers an objective for learning 
the decentralized policies and can be directly maximized by the EM algorithms in ll20l . 


3 Bayesian Learning of Policies 

EM algorithms infer policies based on fixed-size representation and observed data only, it is difficult 
to explicitly handle model uncertainty and encode prior (or expert) knowledge. To address these 
issues, a Bayesian learning method is proposed in this section. This is accomplished by measuring 
the likelihood of 0 using 0), which is combined with the prior p(0) in Bayes’ rule to 

yield the posterior 

p(© \V {K) ) = L(V^;Q) P (Q) [p{V^)}~\ (2) 

where p(V {K ^) is the marginal likelihood of the joint FSC and, up to additive constant, proportional 
to the marginal value function, 

V(V (K >) d K j V{V {K) -,Q)p(Q)dQ 

ocfL(V( K );0)p(0)dO = P (VW). (3) 

To compute the posterior, p(@|T> (A )), Markov chain Monte Carlo (MCMC) simulation If32l is the 
most straight forward method. However, MCMC is costly in terms of computation and storage, and 
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lacks a strong convergence guarantee. An alternative is a variational Bayes (VB) method Q, which 
performs approximate posterior inference by minimizing the Kullback-Leibler (KL) divergence 
between the true and approximate posterior distributions. Because the VB method has a (local) 
convergence guarantee and is able to trade-off scalability and accuracy, we focus on the derivation 
of VB method here. Denoting g(Q) as the variational approximation to p((-)\V <h> ), and q k (z k . t ) as 
the approximation to p(zQ. t \o k :t , 0), a VB objective function^] is 


KL({9 t ‘(4)9(e)} t . 1:Il .||{^p(4„e)} i . 1:A .) = InV(V^) - LB({ 9 f(4)}, 9 (0)), (4) 

where 

LB({«f(4 t )},9(e)) ='■ E 


ln ^P(4.t,Q\K:t) dQ 


lc 

is the lower bound of lnV and 

def. 7 tr t rin=l p( a n, 0 :t\°n,l:t) 


K/q(Q ) 


(5) 


v k = 

T-r N T-rf 


U: =1 UUoP^<rKr)V(V^) 




( 6 ) 


is the re-weighted reward. Since lnV {V {h ' ) in equation (|4]) is independent of 0 and {yf ( z ^ L )}, 
minimizing the KL divergence is equivalent to maximizing the lower bound, leading to the following 
constrained optimization problem, 


max kfe)h<e> lb (W-«)}.9(9)) 

subject to: rfZPj, 0) = El7i 
k n ipi 

E E E‘(7«) = K, 3? (4) > 0, Vz‘, t, k, 

k=1 t= ° *?:N, 0:t =1 

f p(0)dO = 1 and p(Q) > O,V0, (7) 


where the constraint in the second line arises both from the mean-field approximation and from the 
decentralized policy representation, and the last two lines summarize the normalization constraints. 
It is worth emphasizing that we developed this variational mean-field approximation to optimize a 
decentralized policy representation, showing that the VB learning problem formulation ([7]) is both a 
general and accurate method for the multiagent problem considered in this paper. 


3.1 Stick-breaking Policy Priors 

To solve the Bayesian learning problem described above and obtain the variable-size FSCs, the 
stick-breaking prior is used to specify the policy’s structure. As such, Dec-SBPR is formally given 
in definition |2] 

1 Refer to the appendix for derivation details 
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Definition 2. The decentralized stick breaking policy representation (Dec-SBPR) is a tuple (Af, A, O . 
Z, f t , //, p), where Af , A and O are as in the definition of Dec-POMDP; Z is an unbounded set of 
nodes indexed by positive integers; for notational simplicity // are assumed to be deterministic 
with //', = 1, /i^ :oc = 0, 'in; (?/, p) determine (Id 7 ,7r), the FSC parameters defined in section 2.1 as 
follows 


r r n,a,o 


SB (a; 


z,l:oo i, l:oo 
n,a,o i ln,a,o 


), 


7r. 


l:|A t | 


Dirtrif- 1 ) 


( 8 ) 


where Dir represents Dirichlet distribution and SB represents the stick-breaking process with 
rim=i( 2 - V n,a,o) an d Vfj ao ~ Beta(cr^ 0 , rff ao ), rf r f{ ao ~ Gamma(c,d),n = 
, N and i, j — 1, • • • , oo. 


— V 1 ^ 

r r n,a,o r n,a,o 


i, 


DECSBPR differs from previous nonparametric Bayesian RL methods EQEES]. Specifically, 
Dec-SBPR performs policy-based RL and generalizes the nonparametric Bayesian policy represen¬ 
tation of POMDPs |[2TI to the decentralized domain. Whereas Ifl6ll is a model-based RL method 
that doesn’t assume knowledge about the world’s model, but explicitly leams it and then performs 
planning. Moreover, Dec-SBPR further distinguishes from previous methods [[161 El] by the prior 
distributions and inference methods employed. These previous methods employed hierarchical 
Dirichlet processes hidden Markov models (HDP-HMM) to infer the number of controller nodes. 
However, due to the lack of conjugacy between two levels of DPs in the HDP-HMM, a fully conju¬ 
gate Bayesian variational inference does not exis0 Therefore, these methods used MCMC which 
requires high computational and storage costs, making them not ideal for solving large problems. 
In contrast, Dec-SBPR employs single layer SB priors over LSC transition matrices W and sparse 
Gamma priors over SB weight hyperparameters // to bias transition among nodes with smaller 
indices. A similar framework has been explored to infer HMMs, and we refer readers to fl26ll for 
more details. 

It is worth noting that SB processes subsume Dirichlet Processes (DPs) IflTTl as a special case, 
when ajfi 0 — l,ii,j,n,a,o (in Dec-SBPR). The purpose of using SB priors is to encourage a 
small number of LSC nodes. Compared to a DP, the SB priors can represent richer patterns of 
sparse transition between the nodes of an LSC, because it allows arbitrary correlation between the 
stick-breaking weights (the weights are always negatively correlated in a DP). 


3.2 Variational Stick-breaking Policy Inference 

It is shown in [|T8l that the random weights constructed by the SB prior are equivalently governed 
by a generalized Dirichlet distribution (GDD) and are therefore conjugate to the multinomial 
distribution; hence an efficient variational Bayesian algorithm for learning the decentralized policies 
can be derived. To accommodate an unbounded number of nodes, we apply the retrospective 
representation of SB priors |[28l to the Dec-SBPR. Lor agent n, the SB prior is set with a truncation 
level Z n |, taking into account the current occupancy as well as additional nodes reserved for future 
new occupancies. The solution to (|7j) under the stick-breaking priors is given in Theorem [3j the 
proof of which is provided in the appendix. 

2 Nonparametric priors over // can also be used. 

5 The VB method in |12) imposes point-mass proposals over top level DPs, lacking a uncertainty measure. 
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Algorithm 1: Batch VB Inference for Dec-SBPR 

1: Input: Episodes V^ K \ the number of agents N, initial policies 0, VB lower bound LBo = —Inf, ALB = 1, 
Iter = 0; 

2: while ALB > 10 -3 do 

3: for k = 1 to I \, n = 1 to TV do 

4: Update the global rewards { if } using ([9]). 

5: Compute {af k } and {ffff}. 

6 : end for 

7: Iter = Iter + 1. 

8: Compute LBi ter using <0> 

9: ALB = (LBiter — LBiter-i)/|LBit er -i| 

10: for n = 1 to N do 

11: Compute j)} and {$*{i )} using <[TT). 

12: Update the hyper-parameters of 0„ using ([TO]). 

13: Compute \Z n \ using G3- 

14: end for 

15: end while 

16: Return: Policies {0„}^ =1 , and controller sizes {\Z n \]f =l . 


Theorem 3. Let p(0) be constructed by the SB priors defined in ([8]) with hyper-parameters (a, fj, p), 
then iterative application of the following updates leads to monotonic increase of ©, until conver¬ 
gence to a maxima. The updates of {q f} are 



C 'tP( Z n,0:t\°il:V a nW.V ©n), Vn,t, fc, 


(9) 


where if is computed using © with 0 replaced by 0 = {n,p, W}, a set of under-normalized 
probability (mass) functions , withn “ ^ and Wf ao = (f n w f‘a,o) v (w\a,f,) t and (-) p 

denotes expectations of ■ with respect to distributions p. The hyper-parameters of the posterior 
distribution are updated as 


\Z n \ 

cf'j = a i,j , (-id fji,j = M , y' Ci,l 

^n,a,o ^n,a,o 1 Sn,a,o) ln,a,o ln,a,o 1 / j \>n,a,oi 

1 = 3+1 


K 


Pn,i = Pn 




k= 1 t= 0 t=1 


( 10 ) 


with Cfa,o=Ef =1 Er=o Er=i °t)> where I (‘) is the indicator function, and 

both f 


'T k 


0+ k s-n,k 


and (jff are marginals of q^(z^ 0 . t ), i.e. 


=p(4,t = h <r+l = 3 


n,0:ti 




4>t(r\ i )=pfn,T = 


(ID 

( 12 ) 


The update equations in Theorem [3] constitute the VB algorithm for learning a variable-size 
joint FSCs under SB priors with batch data. In particular, <(9]) is a policy-evaluation step where the 
rewards are reweighted to reflect the improved marginal value of the new policy posterior updated 
in the previous iteration, and ( |T0| ) is a policy-improvement step where the reweighted rewards are 
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used to further improve the policy posterior. Both steps require ( [IT] ), which are computed based on 

<’ fc (*)=P« r =*K 0: r>< 1: r>@n) and ffj T k (i) =^ Vw, k, t, t. The (a,/3) 

are forward-backward messages. Their updating equations are derived in the appendix. 

To determine the number of controller nodes {| Z n . | }^ =1 , the occupancy of a node is computed by 
checking if there is a positive reward assigned to it. For example, for action a and node i, p a ni — p a ni 
is the reward being assigned. If this quantity is greater than zero, then node i is visited. Summing 
over all actions gives the value of node i. Hence \Z n \ can be computed based on the following 
formula 


i2»i = E”ii(Eit"i'(p;,-py >o). as) 

The complete algorithm is described in Algorithm [T] Upon the convergence of Algorithm [TJ point 
estimates of the decentralized policies may be obtained by calculating the expectation: E[/tJJ, 
and E [W^ a \ (see the appendix for details). 


Table 1: Computational Complexity of Algorithm |TJ 


VAR 

BEST CASE 

WORST CASE 

a 

Q(N\Z\ 2 KT) 

0{N\Z\ 2 KT) 

P 

n(N\z\ 2 kt) 

0(N\Z\ 2 KT 2 ) 

4 

n(K) 

O(KT) 

0 

n(N\Z\ 2 KT) 

o(n\z\ 2 kt 2 ) 

LB 

«(|2| 2 £".1 IA,l|0„l) 

0(\Z\ 2 Zn=Mn\\0 n \) 


3.3 Computational complexity 

The time complexity of Algorithm [I] for each iteration is summarized in Table [I] assuming the 
length of an episode is on the order of magnitude of T, and the number of nodes per controller is on 
the order of magnitude of \Z\. In Table[lj the worst case refers to when there is a nonzero reward 
at every time step of an episode (dense rewards), while the best case is when nonzero reward is 
received only at the terminal step. Hence in general, the algorithm scales linearly with the number 
of episodes and the number of agents. The time dependency on T is between linear and quadratic. 
In any case, the computational complexity of Algorithm [T] is independent of the number of states, 
making it is scalable to large problems. 

3.4 Exploration and Exploitation Tradeoff 

Algorithm [T| assumes off-policy batch learning where trajectories are collected using a separate 
behavior policy. This is appropriate when data has been generated from real-world or simulated 
experiences without any input from the learning algorithm (e.g., learning from demonstration). 
Off-policy learning is efficient if the behavior policy is close to optimal, as in the case when expert 
information is available to guide the agents. With a random behavior policy, it may take a long time 
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for the policy to converge to optimality; in this case, the agents may want to exploit the policies 
learned so far to speed up the learning process. 

An important issue concerns keeping a proper balance between exploration and exploitation 
to prevent premature convergence to a suboptimal policy, but allow the algorithm to leam quickly. 
Since the execution of Dec-POMDP policies is decentralized, it is difficult to design an efficient 
exploration strategy that guarantees optimality. 051 count the visiting frequency of FSC nodes 
and apply upper-confidence-bound style heuristic to select next controller nodes, and use e-greedy 
strategy to select actions. However e-greedy might be sample inefficient. [[6l proposed a distributed 
learning approach where agents take turns to learn the best response to each other’s policies. This 
framework applies an R-max type of heuristic, using the counts of trajectories to distinguish known 
and unknown histories, to tradeoff exploration and exploitation. However, this method is confined 
to tree-based policies in finite-horizon problems, and requires synchronized multi-agent learning. 

To better accommodate our Bayesian policy learning framework for RL in infinite-horizon 
Dec-POMDPs, we define an auxiliary FSC, Q„ = (y, O n . Z n . ; W n , fin , <p n ), to represent the policy 
of each agent in balancing exploration and exploitation. To avoid confusion, we refer to (-)„ as a 
primary FSC. The only two components distinguishing from 0 n are y and (p n , where 3^ = {0,1} 
encodes exploration (y = 1) or exploitation (y = 0), and ip n = {py' z } with ip™’ z denoting the 
probability of agent n choosing y in z. One can express p(y n ,t\h n ,t, V IC.) in the same way as one 
expresses p(a nt t\h njt , 0 n ) (which is described in section 2.1). The behavior policy H n of agent n is 
given as 


p Un (a\h, 0 n , Q n ) = ^ p(a\y, h)p(y\h, Q n ), (14) 

y= o,i 

where p(a\y = 0, h) = p(a\h, 0 n ) is the primary FSC policy, and p(a\y = 1, h) is the exploration 
policy of agent n, which is usually a uniform distribution. 

The behavior policy in ( fl4] ) has achieved significant success in the single-agent case [[131 '27, 
[221 . Here we extend it to the multi-agent case (centralized learning and decentralized explo¬ 
ration/execution) and provide empirical evaluation in the next section. 


4 Experiments 

The performance of the proposed algorithms are evaluated on five benchmark problems [[III and a 
large-scale problem (traffic control) |[35ll . The experimental procedure in 0511 was used for all the 
results reported here. For Dec-SBPR, the hyperparameters in (j8]) are set to c = 0.1 and d = 10~ 6 
to promote sparse usage of FSC nodes J^j The policies are initialized as FSCs converted from the 
episodes with the highest rewards using a method similar to J51. 

Learning variable-size FSC vs learning fixed-size FSC To demonstrate the advantage of learn¬ 
ing variable-size FSCs, Dec-SBPR is compared with an implementation of the previous EM 
algorithm 051 . The comparison is for the Mars Rover problem using K = 300 episodes^] to learn 
the FSCs and evaluating the policy by the discounted accumulated reward averaged over 100 test 
episodes of 1000 steps. Here, we consider off-policy learning and apply a semi-random policy to 

4 These values were chosen for testing, but our approach is robust to other values of c and d. 

5 Using smaller training sample size K , our method can still perform robustly, as it is shown in the appendix. 
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Mars Rover (|S|=256,|A|=6,|0|=8) 




Figure 1: Comparison between the variable-size controller learned by Dec-SBPR and fixed-size controllers 
learned by EM. Left: testing value; Right: averaged computation time. Although EM has less computational 
complexity than our VB algorithm per iteration, empirically, the VB algorithm uses less time to reach 
convergence. Moreover, the average value of using the SB prior and its special Dirichlet instance (DP) are 
close to the best result of EM (dotted sky-blue line), but are much better than the average results of EM (solid 
back line). Using the SB prior achieves slightly better performance than the DP, which can be explained the 


flexibility of SB prior, as explained by the last paragraph of section 3.1 


Table 2: Performance of Dec-SBPR on benchmark problems compared to other state-of-art algorithms. 
Shows policy values (higher value indicates better performance) and CPU times of all algorithms, and the 
average controller size \Z\ inferred by Dec-SBPR. 


Problems (|«S|, |-4|, \ 0 \) 



POLICY LEARNING (UNKNOWN MODEL) 




PLANNING (KNOWN MODEL) 




Dec-SBPR(fixed iteration) 
Value \Z\ Time 

Dec-SBPR(fixed time) 
Value \Z\ Time 

MCEM 
Value \Z\ 

Time 

PERIEM 
Value \Z\ 

Time 

FB- 

Value 

-HSVI 

1*1 

Time 

Dec-Tiger (2, 3, 3) 

-18.63 

6 

96s 

-19.42 8 

20s 

-32.31 

3 

20s 

9.42 7 x 10 

6540s 

13.45 

52 

6.0s 

Broadcast (4, 2, 5) 

9.20 

2 

7s 

9.27 2 

24s 

9.15 

3 

24s 

- 


9.27 

102 

19.8s 

Recycling Robots (3, 3,2) 

31.26 

3 

147S 

25.16 2 

19s 

30.78 

3 

19s 

31.80 6 x 10 

272s 

31.93 

108 

0s 

Box Pushing (100,4, 5) 

77.65 

14 

290s 

58.27 9 

32s 

59.95 

3 

32s 

106.68 4x 10 

7164S 

224.43 

331 

1715.1S 

Mars Rovers (256,6, 8) 

20.62 

5 

1286s 

15.2 6 

160s 

8.16 

3 

160s 

18.13 3x10 

7132s 

26.94 

136 

74.31S 


collect samples. Specifically, the learning agent is allowed access to episodes collected by taking 
actions according to a POMDP algorithm (point-based value iteration (PBVI) Il29ll ). Let e be the 
probability that the agents follow the PBVI policy and 1 — e be the probability that the agents take 
random actions. This procedure mimics the approach used in previous work lf35l . The results with 
77 = 0.3 are reported in Figure |T] which shows the exact value and computation time as a function 
of the number of controller nodes \Z\. As expected, for the EM algorithm, when \Z\ is too small, 
the FSCs cannot represent the optimal policy (under-fitting), and when the number of nodes is too 
large, FSCs overfits a limited amount of data and perform poorly. Even if \Z\ is set to the number 
inferred by Dec-SBPR, EM can still suffer severely from initialization and local maxima issues, 
as can be seen from a large error-bar. By setting a high truncation level (\Z\ = 50), Dec-SBPR 
employs Algorithm [I] to integrate out the uncertainty of the policy representation (under the SB 
prior). As a result, Dec-SBPR can infer both the number of nodes that is needed (^ 5) and optimal 
controller parameters simultaneously. Furthermore, this inference is done with less computation 
time and with a higher value and improved robustness (low variance of test value) than EM. 
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Figure 2: Performance comparison on traffic control 
and inferred controller size (right) of Dec-SBPR, as 



Number of Iterations 

problem (10 20 states and 100 agents). Test reward (left) 
a function of algorithmic iteration. 


Comparison with other methods The performance of Dec-SBPR is also compared to several 
state-of-art methods, including: Monte Carlo EM (MCEM) If35ll . Similar to Dec-SBPR, MCEM is a 


and follow the same experimental procedure in [f35ll to report the result^j The rewards after running 
a fixed number of iterations and a fixed amount of time are summarized (respectively) in Table [2] 
(the first column under policy-learning category). Dec-SBPR is shown to achieve better policy 
values than MCEM on all problems Q These results can be explained by the fact that EM is (more) 
sensitive to initialization and (more) prone to local optima. Moreover, by fixing the size of the 
controllers, the optimal policy from EM algorithms might be over/under fitted . By using a Bayesian 
nonparametric prior, Dec-SBPR learns the policy with variable-size controllers, allowing more 
flexibility for representing the optimal policy. We also show the result of Dec-SBPR running the 
same amount of clock time as MCEM (Dec-SBPR (fixed time)), which indicates Dec-SBPR can 
achieve a better trade-off between policy value and learning time than MCEM. 

Finally, Dec-SBPR is compared to Periodic EM (PeriEM) E71 and FB-HSVI [|T5l . two state-of- 
art planing methods (with known models) for generating controllers. Because having a Dec-POMDP 
model allows more accurate value function calculations than a finite number of trajectories, the value 
of PeriEM and FB-HSVI are treated as upper-bounds for the policy-based methods. Our Dec-SBPR 
approach can sometimes outperform PeriEM, but produces lower value than FB-HSVI. FB-HSVI 
is a boundedly-optimal method, showing that Dec-SBPR can produce near optimal solutions in 
some of these problems and produces solutions that are much closer to the optimal than previous 
RL methods. It is also worth noting that neither PeriEM nor FB-HSVI can scale to large problems 
(such as the one discussed below), while by using a policy-based RL approach, Dec-SBPR can scale 
well. 

Scaling up to larger domains To demonstrate scalability to both large problem sizes and large 
numbers of agents, we test our algorithm on a traffic problem [|35ll . with 10 20 states. Here, there 
are 100 agents controlling the traffic flow at 10 x 10 intersections with one agent located at each 
intersection. Except for MCEM, no previous Dec-POMDPs algorithms are able to solve such large 

6 The learning curves of Dec-SBPR are shown in the appendix. 

7 The results are provided by personal communication with its authors and run on the same benchmarks that are 
available online. 


policy-based RL approach. We apply the exploration-exploitation strategy described in section 3.4 
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problems. 

Since the authors in li35ll use a hand-coded policy (comparing the traffic flow between two 
directions) as a heuristic for generating training trajectories, we also use such a heuristic for a 
fair comparison. In addition, to examine the effectiveness of the exploration-exploitation strategy 
described in Section [3T4] , we also consider the case where the initial behavior policy is random and 
then it is optimized as discussed. From Figure |2j we can see that, with the help of the heuristic, 
Dec-SBPR can achieve the best performance. Without using the heuristic (by just using our 
exploration-exploitation strategy), in a few iterations, Dec-SBPR is able to produce a higher quality 
policy than MCEM. Moreover, the inferred number of FSC nodes (averaged over all agents) is 
smaller than the number preselected by MCEM. This shows that not only can Dec-SBPR scale to 
large problems, but it can also produce higher-quality solutions than other methods for those large 
problems. 


5 Conclusions 

The paper presented a scalable Bayesian nonparametric policy representation and an associated 
learning framework (Dec-SBPR) for generating decentralized policies in Dec-POMDPs. An new 
exploration-exploitation method, which extends the popular e-greedy method, was also provided 
for reinforcement learning in Dec-POMDPs. Experimental results show Dec-SBPR produces 
higher-quality solutions than the state-of-art policy-based method, and has the additional benefit of 
inferring the number of nodes needed to represent the optimal policy. The resulting method is also 
scalable to large domains (in terms of both the number of agents and the problem size), allowing 
high-quality policies for large Dec-POMDPs to be learned efficiently from data. 
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Appendices 

A Proof of Theorem |3j: (Mean-field) Variational Bayesian (VB) Inference for 
DEC-SBPR 

Under the standard variational theory ElTlj, minimizing the KL divergence between q( O, z) and 
p( O, z[D) is equivalent to minimizing the lower bound of log marginal likelihood (empirical value 
function for our case). Using Jensen’s inequality, we can obtain the following lower bound of the 
log marginal value function 


K T k 


\nV(V {K) ) = >"*EE E Itf&Mmri) 


k=1 t=0 2 * 

x - , n ? 


rm®)p(y)p(ait, 9) 
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dQdp 
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resentations. 


To derive the VB updating equations, we rewrite the lower bound in equation (|I5]) as follows 
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The VB Inference algorithm for DEC-SBPR is based on maximizing LB w.r.t. the distribution of 
the joint DEC-SBPR parameters {{4(4,0:*)}fc,*> <?(©n)> q(Vn)} =1 ^ which can be achieved by 

alternating the following steps. 


Update the distribution of nodes (VB E-step): Keeping {v(0 n )}n=i,-,iv and {q(r) n )} n =i, -, 
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which is solved to give the distribution of nodes z% 0:t for the n th agent 
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where ?/?(•) is digamma function, u]^ a 0 is the reward (soft-count) allocated to the transition from 
state i to j, both &:'f a 0 and d l n j a 0 are the posterior parameters of r)l? ao . 

Update the sufficient statistics (VB-M Step) In VB-M Step, the distribution of nodes 


Vfc, t are fixed, the objective is to solve max g(0n)i?(rM) LB I ■| {g t fc (*£ 0:t )} fcit , ?(©), ?(Pn) j I, Vn 

subject to the normalization constraint that f q((-))d(~) = 1. First we consider finding q(O n ). To that 
end, we construct the Lagrangian 
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and v* is the marginalized re-weighted reward computed by equation [6j 
To find q(rj), we construct the Lagrange, 

Fqp) = LB ({<£}, g(Q),q(ri)) -\(l- [ q(p)dp) 


( 21 ) 


( 22 ) 


(23) 


15 










r)V 1 r i 

—-^yy = — J q(Q) In p{r])p{G)dG - — In q(rj) + A = 0 


( 24 ) 


q(rj) oc exp < / g( 0 ) 6177(77 )p( 0 | 77 )d© > 


cx exp | y g( 0 ) lnp(0\p)d0 + In^( 77 ) 
oc 70 ( 77 ) exp |(lnp( 0 ))} 

N \A n \ \@n\ \%n\ \^n\ p/ 71,a,O , 71,Cl,0\ n ,a 

=n n n n n <m<t; * <o i 


(25) 


(i - y. n ’ a ’°)<f' 


-1 


r( n,a, 0 )r( n,a,ox 
71=1 0=1 j=l i=l j=l V M ' V 'h3 > 

N \A n \ \On | |-^n| |-2 n | p/ n,a,o | n,a,o\ 

n n n n n r ( „..^ wn c " «p i - <r<<* - mi - v"”))} 

n=l a=l j'=l 7=1 j=l ' '*>1 


One can set cr™) a ’° = 1, in this case, the VB approximation of q(a) is a product of independent 
gamma distributions. However, when a )'.-"’ 0 7 ^ 1, ( |24| ) is no longer a gamma distribution (the prior 
and l ik elihood are not conjugate). To solve this issue, one might consider the VB inference method 
for non-conjugate priors ll33l . by which we consider a point estimate of 77, such at 17(77) is maximized. 
One way to obtain the maximum estimate of 77 is to solve = 0, however this operation involves 
taking derivative w.r.t gamma functions, which does not have a simple form solution. To circumvent 
this difficult, we use grid search. To make the search more efficient, we use the bounds of r! f !'. |y J to 
give an initial estimate of the searching range. The bounds are from Wendel’s double Inequality ’ 


x(x + a) a 1 < T(a + x)/T(x) < x° 


(26) 


where x, a > 0 . 


B Some Basics of Stick-breaking Priors 

Here we provide the definition of stick-breaking prior (SBP), its connection to generalized Dirichlet 
distribution and the corresponding posterior inference, as well as the main statistics characteristics 
which are useful for developing the inference methods in our paper. For more detailed mathematical 
treatment, readers are referred to lfT8l [34] 

Definition 4. The stick-breaking priors ftT8 1/ are almost surely discrete random probability measures 
V over the the measurable space (Q. B) which are partitioned into d disjoint regions with Q = U Bk 
for 1, • • • ,d. It is expressed as 

d 

(27) 

k =1 

and 

Pi = Vj and pi = (1 - Vi)(l - V 2 ) ■ ■ ■ (1 - Vi_ x )V h i > 2 (28) 

are the weights with V) are independent Belaid},, bf) random variables for Oj, b, > 0. 
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SBP allows Beta-distributed RVs Vi, Vi and the atoms Vi associated with the resulting 
weights to be drawn simultaneously. 


B.l Stick-breaking prior and generalized Dirichlet distribution 


We denote p ~ SB(v, w ) as constructing p as an infinite process (d —>■ oo) as ( |28j ), and p ~ 
GDD(v, w) when p is finite. Here, GDD stands for generalized Dirichlet distribution. To see the 
connection between SBP and GDD, set the truncation level (number of occupied states) to d with 
Pd+i = 1 — Yli= i Pi > we can write down the density function of V = (V\, ■ ■ ■ , Vd) as 
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/( v)=n/(v ( )=n 


- Vi) Wi ~ 


(29) 


. T(vi)T(wi) 

By changing variables from V to p and using the relation between V and p as described by ([28]), 
the density of p can be obtained as follows, 
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When Wi = Yhj=i+i v i f° r 1 < d, an< 3 keeping «v/ = the GDD is equivalent to the standard 
Dirichlet distribution. 

As a concrete example, consider the case d = 3, we have 
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Plugging these relations into ( |30] ), we can obtain 
dV 

fip) = \^~\f(V) 


dp 


2 — 1 




P2 \V2-1 


2=1 


3 = 1 

P3 iv 3 -l 


r(vi)r(wi) 


P2 


\W3—1 


_)^ 3 -l(l- 

1-Pl~p 2 1 - Pi -p 2 1 - 

ri STlGkAi - Er-)”‘- ( '‘ +1+ - + '>. 


G - pi 
P4 


P1-P2-P3 


) U 2 - 1( 1 _ 


) t ’ 4 - 1 (l 


P 2 \lfJ 2 — 1 


Pi 


Pi 


(32) 


\w 4-1 


P1-P2- P3 


11 r(^)r(wj 


3 = 1 


17 




















B.2 Bayesian Inference for GDD 

Given a set of discrete observations {X n } Discrete(p ), and the prior p ~ SB(v,w), the 

posterior of p can be written down as 
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(33) 


where the posterior hyper-parameters are updated as v\ = Vi + Yln=i H X n = ’> ) and w\ = 
Wj + JZjyi ^2n =i = 0 ’ where I(-) is an indicator function with value equal to one when the 
argument is true and zero otherwise. 


C The Computation of Forward and Backward Variables (a, f3) 


The forward and backward variables (a, /3) — p{z k T = i|a£ 0:r , o k 1:T , 0„) and Ptf(i) — 
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computed recursively by each agent using (34)-(36). 
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D Additional Experimental results 

D.l learning variable-size FSC vs learning fixed-size FSC 

Additional experiments are added to study the impact of number of training samples. 
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Figure 3: Comparison b c lw ccmivuriabic-s i zc controller learnedly f 0©c-S B PR 1 and fixed-size controller 
learned by EM algorithm. Top: testing value of policies learned using different number of training 
episodes (A' = 30(left),A = 100(middle),A' = 300(right)); Bottom: averaged computation time of EM (left) 
and SB (right). 


D.2 Sequential Batch Learning with Exploration Exploitation Trade-offs 

Additional results from sequential batch learning with exploration exploitation trade-offs for five 
domains are plotted in Figure [4] In each iteration, a batch of samples are collected with updated 
behavior policic^ and are used to learn a set new policies with Algorithm [l] These results are 
associated with the numbers reported in the first two columns of Table [2] in the main body of the 
paper. 


Tiger(ISI=2,IAI=3,101=3] 




Tiger(ISI=2,IAI=3,101=3] 



Tiger(ISI=2,IAI=3,101=3] 

— Agent 1 


s To generate these plots, 50 trajectories are collected in each iteration and the exploration parameter is set to be 
u = 100. 
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Broadcast (ISI=4,IAI=2,101=5] 
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Mars Rover (ISI=256,IAI=6,101=8] 
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Figure 4: Additional plots for illustrating exploration-exploitation tradeoff, including testing value (left), 
inferred controller numbers (middle) and exploration rate (right). In each iteration, a batch of samples are 
collected with updated behavior policies and are used to learn a set new policies with Algorithm [T] 
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